19 research outputs found
Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization
We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which
is fundamental for microbiome analysis. In this problem, the goal is to
reconstruct the identity and frequency of species comprising a microbial
community, using short sequence reads from Massively Parallel Sequencing (MPS)
data obtained for specified genomic regions. We formulate the problem
mathematically as a convex optimization problem and provide sufficient
conditions for identifiability, namely the ability to reconstruct species
identity and frequency correctly when the data size (number of reads) grows to
infinity. We discuss different metrics for assessing the quality of the
reconstructed solution, including a novel phylogenetically-aware metric based
on the Mahalanobis distance, and give upper-bounds on the reconstruction error
for a finite number of reads under different metrics. We propose a scalable
divide-and-conquer algorithm for the problem using convex optimization, which
enables us to handle large problems (with species). We show using
numerical simulations that for realistic scenarios, where the microbial
communities are sparse, our algorithm gives solutions with high accuracy, both
in terms of obtaining accurate frequency, and in terms of species phylogenetic
resolution.Comment: To appear in SPIRE 1
Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information
Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods. © Springer-Verlag Berlin Heidelberg 2009